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ABSTRACT 

Aims. The aim of this work is to develop a comprehensive method for classifying sources in large sky surveys and we apply the techniques to 
the VIMOS Public Extragalactic Redshift Survey (VIPERS). Using the optical (u\ g', r', i') and NIR data (z\ K s ), we develop a classifier for 
identifying stars, AGNs and galaxies improving the purity of the VIPERS sample. 

Methods. Support Vector Machine (SVM) supervised learning algorithms allow the automatic classification of objects into two or more classes 
based on a multidimensional parameter space. In this work, we tailored the SVM for classifying stars, AGNs and galaxies, and applied this 
classification to the VIPERS data. We train the SVM using spectroscopically confirmed sources from the VIPERS and VVDS surveys. 
Results. We tested two SVM classifiers and concluded that including NIR data can significantly improve the efficiency of the classifier. The 
self-check of the best optical + NIR classifier has shown a 97% accuracy in the classification of galaxies, 97% for stars, and 95% for AGNs in 
the 5-dimensional colour space. In the test on VIPERS sources with 99% redshift confidence, the classifier gives an accuracy equal to 94% for 
galaxies, 93% for stars, and 82% for AGNs. The method was applied to sources with low quality spectra to verify their classification, and thus 
increasing the security of measurements for almost 4 900 objects. 

Conclusions. We conclude that the SVM algorithm trained on a carefully selected sample of galaxies, AGNs, and stars outperforms simple 
colour-colour selection methods, and can be regarded as a very efficient classification method particularly suitable for modern large surveys. 
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1. Introduction 

Over the years, the amount of astronomical data collected by 
satellites and ground-based surveys is steadily increasing. The 
zoo of collected data, such as photometry, redshifts, spectral 
lines, and morphology, is constantly expanding, and increas- 
ingly, researchers are turning to automated algorithms to explore 
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the high-dimensional parameter space. Although computation- 
ally challenging, the goal is to make use of every available fea- 
ture to recognise and extract the most discriminating patterns 
and allow the full systematisation of the data. 

Furthermore, the study of the dependence of galaxy proper- 
ties on physical parameters such as galaxy mass or environment 
can greatly benefit from the efficient classification of sources. 
The classification of different types of sources is one of the basic 
and at the same time crucial tasks to perform before moving into 
any scientific analysis. 

The first physical classification of sources in a photomet- 
ric sky survey is between foreground stars within the Galaxy 
and extragalactic sources. Generally, the distinction between 
stars and galaxies can be made based upon morphological 
measurements; point sources are classified as stars while ex- 
tended sources are cla s sified as galaxies e.g. IVasconcellos et al.l 
(I2011I) : iHenrion et"all (I2011I) . At bright apparent magnitudes 
morphology seems to be a sufficient criterion for classifica- 
tion. However, resolved stellar selection in th e current and 
next g eneration of wide-field surve y s, like Euclid (Laureiis et al. 
| 2012h . BigB OSS dSchleeel et al] l2009h . P ES dAbbott et al. 
I 2005I) . LSST dlvezicet al.ll2009l). L AMOST dB land-Hawthorn 
BOiai . Pan-STARRS dKaiseretaD l2010h and/or deep sur- 



1 



K. Malek et al.,: VIPERS: A Support Vector Machine classification of galaxies, stars and AGNs 



yeys, like VUDS (|Le Fevre et a 



12013. in pr eparation). HUDF 
dBeckwith et aD 120061) DLS dWittman et alJ l2QQ2h. VISTA 
(lEmerson & Sutherland! 120 1 0l) . is being challenged by the vast 
number of unresol ved galaxies at faint apparent magnitudes 
dFadelv et al.ll2012h . 

In the case of fainter sources, colour-colour diagrams are 
the most widely used tools to separate different classes of ce- 
lestial sources from one another since different types of objects 
will appear in different colour regions in such diagrams due to 
the shape of the Spectral Energy Distribution (SED). For ex- 
ample, galaxies possess much redder colours rel ative to stars 
due to the higher flux at longer wavelengths (e.g.. I Walker et al.l 
Il989l) . Classification methods based on colour-colour selection 
were employed for star-galaxy se paration (e.g. infrared colour 
diagram used by lPollo et al.ll2010l) or for finding special classes 
of sources e.g. high/low redshift quasar s, active galactic nu- 
clei, starburst ga l axies, or variable stars (Ric hards et all 120021; 
Stern et al.ll2005i l2012t IChiu et al.ll2QQ5h iBrightman & Nandral 



20121: IWozniak et al.ll2004l)~ 



Support Vector Machines (SVMs) are a class of supervised 
learning algorithms, created as an extension to nonlinear mod- 
els of th e generalised portrait algorithm developed by Vladimir 
Vapnik (I Vapnikl 1 1 995b . for classification in a multidimensional 
parameter space. These algorithms are based on the concept of 
decision planes to classify objects using their relative positions 
in the ^-dimensional parameter space. A large number of ob- 
served properties may be analysed simultaneously by the classi- 
fier making full use of the data. Within the full parameter space, 
it is possible to build a more reliable classifier than is possible by 
only using a subset of the data (for example, by analysing only 
two photometric colours, instead of the complete set). On the 
other hand, the method requires a Training Sample, that is, a set 
of data that have known classifications. Generally, SVM algo- 
rithms are sensitive to the measurement errors and are of limited 
use fo r extracting information from noisy data sets dFadelv et al] 
The classification of observed sources in astronomy is a 
fundamental problem, and there is still no approach completely 
free of drawbacks; however SVM algorithms are a novel and 
very promising classification strategy. 

In this paper we apply the SVM algorithm t o photomet- 
ric data. Previous works (e.g., iFadelv et liDl2012l ISolarz et all 
120 121: IVasconcellos et al.ll201 ll : Ball et al.l 12006^ show high ef- 
ficiency of that approach for two classes of objects (galaxies 
and stars). Recently, the Photometric Classification Server (PCS) 
for the prototype of the Panoramic Survey Telescope and Rapid 
Response System (Pan-STARRSl) based on Support Vector 
Machines was developed (Saglia et al., 2012). The PCS system 
is using five photometric bands: gpi , r P i , gi P i , z P i , and y P i , and is 
able to separate three groups of sources (stars, galaxies, QSOs), 
without any pre-selection based on colour or redshift range, with 
high accuracy of galaxy classification (-97%). The purities of 
stellar and QSO samples' classifications are worse, at the level 
of 85%, and 83%, respectively. 

We decided to develop a three class recognition algorithm, 
which will be able to classify galaxies/ AGNs/stars based on 
the photometric data in The Canada France Hawaii Telescope 
Legacy Survey (CFHTLS). We used, as a training set in colour 
space objects with the best quality spectra from the VIMOS 
Public Extragalactic Redshift Survey (VIPERS) and VIMOS 
VLT Deep Survey (VVDS) Deep (F02 field) and Wide (F22 
field) data. After careful selection of objects from VIPERS by 
SVM, and defining characteristic patterns for different types of 
sources, it will be possible to enlarge the sample of galaxies to 
be used for more detailed studies. We plan to use this trained 



classifier on a large number of sources possessing low quality 
spectra within VIPERS to recover sources that cannot be classi- 
fied based upon the spectrum alone. A majority of objects with 
lower quality spectral information are absorption line systems 
with low signal-to-noise ratio. Faint red stars and faint passive 
galaxies are often difficult to distinguish based on their spectral 
features, if the quality of a spectrum is low. Reconfirmation of a 
class of such an object by the SVM classifier (galaxy, AGN or 
star) based upon the photometric measurements increases also 
the probability that their spectroscopically measured redshift is 
correct. 

The paper is organised as follows. In Section [2] we describe 
the data used in our analysis, both spectroscopic and photomet- 
ric. Section [3] describes the principles of the SVM learning al- 
gorithm. In Section [4] we introduce the Training Sample used in 
our work. In Section [5] we compare the efficiency of the classi- 
fier with and without near infrared data. Additionally, we present 
the results of the analysis of the basic tests for the classifiers - 
self-check and test of the classifier on the VIPERS galaxies with 
redshift measurements confirmation level equal to 95%. The sec- 
tion closes with the selection of the optimal classifier used for 
our subsequent analysis. Section [6] describes results of classifi- 
cation of optical near-infrared SVM classifier objects from the 
VIPERS samples. Finally, in Section [7] we discuss the advan- 
tages and limitations of our current SVM classifier and we out- 
line our improvements for the presented classifier. 



2. Data 

2.1. Photometric data 
CFHTLS photometry 

The CFHTLS, a joint Canadian-French program, has three dis- 
tinct survey components: (1) the SuperNovae Legacy Survey the 
"Deep" survey, (2) the "Wide" - wide synoptic survey (on which 
VIPERS survey was based), and (3) a very wide shallow survey, 
the "Very Wide". 

The heart of MegaPrime, the wide-field optical imag- 
ing f acility, is the MegaCam CCD camera (iBoulade et al] 
|2000|) . MegaCam, provides multi-colour photometry with 
wavelength (A) coverage from 3500 to 9400A. The main 
characteristics of the MegaPrime/MegaCam broad band fil- 
ters are described in Tab. Q] For a more detailed descrip- 
tion we refer the reader to the CFHTLS official web page 
http://www.cfht.hawaii.edu/Science/CFHTLS/, All magnitudes 
used in this paper are the AB photometry system. 



Table 1. MegaPrime* and WIRCam** Filters Characteristics. 



Filter 


u* 


g' 


r' 


i' 


z' 


K s 


central A (nm) 


374 


487 


628 


111 


1170 


2146 


bandwidth 


76 


145 


122 


151 


687 


325 


(nm) 














Max. trans- 


77.5 


93.5 


96.3 


98 


95 


98 


mission (%) 














mag. limit*** 


25.30 


25.50 


24.80 


24.48 


23.60 


22.00 



* http://wwwxfht.hawaii.edu/Instmments/Filters/megaprime.html 

** http://www.cfht.hawaii.edu/Instruments/Filters/wircam.html/ 

*** measured as the 50% of completeness (MegaPrime) and 5<x (WIRCam) 

for point sources. 
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T he data used in this work are a part of CFHTLS T0005 re- 
lease dMellier et al.ll2008l) . produced at the TERAPI^Q data cen- 
tre. We consider a subsample of CFHTLS T0005 catalogue with 
spectroscopic redshift measured by VIPERS. 

WIRCam data 

In our work, we also used near-infrared K s measurements, in 
the AB magnitude system, corrected for galaxy ex tinction, taken 
from Wide-field InfraR ed Camera (WIRCam; iThibault et al.l 
l20Q3h iPuget et aDl2004l) . coming from the dedicated follow-up 
observations for the VIPERS project (Arnouts et al. 2013, in 
preparation). The K s filter has a central wavelength of 2146nm, 
and maximum transmission on the level of 98%. One may find 
the detailed description of WIRCam detector on the WIRCam 
CFHT web pagfl 

2.2. Spectroscopic data 
VIPERS survey 

The VIMOS Public Extragalactic Redshift Survey (see 
http://vipers.inaf.it) is an ongoing Large Programme aimed at 
measuring redshifts for ~ 10 5 galaxies at redshift 0.5 < z < 1.2, 
to accurately and robustly measure clustering, the growth of 
structure (through redshift- space distortions) and galaxy prop- 
erties at an epoch when the Universe was about half its cur- 
rent age. The galaxy target sample is selected from optical 
photometric catalogues of the Canada-Fr ance-Hawaii Telescope 
Legacy Survey W ide (CFHTLS-Wide, iGoranova et all 120091: 
iMellier et afll2008l) . VIPERS covers ~ 24 deg 2 on the sky and is 
divided into two areas within the Wl and W4 CFHTLS fields. 
Galaxies are selected to a limit of iab < 22.5 measured using 
Sextractor's mag.auto (lKronlll980l) -like magnitude. In addi- 
tion, a simple and robust colour pre-selection in (g - r) vs (r - i) 
is applied to efficiently remove galaxies at z < 0.5. In combina - 
tion with an efficient observing strategy (iScodeggio et al.ll2009l) . 
this allows us to double the galaxy sampling rate in the redshift 
range of interest with respect to a purely magnitude-limited sam- 
ple, reaching to ~ 40%. At the same time, the area and depth of 
the survey results in a fairly large volume, 5 x 10 7 h~ 3 Mpc 3 , 
analogous to that of the 2dFGRS at z ~ 0.1. This combination of 
sampling and depth is quite unique over current redshift surveys 
atz > 0.5. 

VIPERS spectra are collected with the Visible imag ing 
Multi-Object Spectrograph (VIMOS, iLe Fevre et all l2000h at 
moderate resolution (R = 210), using the LR Red grism, pro- 
viding a wavelength coverage of 5500-9500A, for a typical red- 
shift rms error of cr z =0.00047 (1+z). The full VIPERS area of 
~ 24 deg 2 is covered through a mosaic of 288 VIMOS pointings 
(192 in the Wl area, and 96 in the W4 area). Of the VIPERS 
spectroscopic targets, more than 51 000 K s counterparts were 
found: 96% (80%) of our spectra for Wl (W4) field have K s 
measurements. More detailed description of WIRCam follow- 
up survey for VIPERS proj ect can be found in lFritz et~aD (1201 3l) 
and lDavidzon et al] (120131) . 

The redshift quality is quantified at the time of validation by 
attributing grading flags (VIPERSzfiag) that are obtained from 
repeated measurements of redshift for the same sources. The 
VIPERSzfiag for galaxies and stars range from a value of 4, indi- 
cating >99% of confidence that the measurement is secure, to 0, 



representing lack of a reliable estimate of redshift. VIPERSzfiag 
equal to 9 correspond to galaxies with only one single clear 
spectral emission feature. Objects classified as AGNs follow the 
same scheme but their flags are increased by 10. A similar sys- 
tem w as used and tested e.g. for VVDS survey (iLe Fevre et al.l 
120051) . A discussion of the survey d ata reduction and man- 
agement infrastructure is presented in iGarilli et aD (120121) . An 
early subset of the spectra used here has been analysed and 
classified through a P rincipal Component Analysis (PCA) in 
Marchetti et al. (20121). A more complete description of the sur- 
vey construction, from the definition of the target sample to the 
actual spectra and redshi ft measurements, is g iven in the parallel 
survey description paper, iGuzzo et aD (120131) . 

The data set used in this paper are those of t he early sci- 
ence d ata release of VIPERS data a s desc ri bed in IGuzzo et al.l 
(120131): see al so: Ide la Torre et al l (120131) . lFritzetal.1 (120131) 
iMarulli et al.l (120131) . iBel et al.l (12013b and iDavidzon et al.l 



1 http://terapix.iap.fr/ 

2 http://www.cfbt.hawaii.edu/Instmments/Imaging/WIRCarn/ 



(20131). This data will be publicly available in fall 2013 as the 
VIPERS Public Data Release 1 (PDR-1) catalogue. This cat- 
alogue includes 55,358 redshifts and corresponds to the re- 
duced data as it was in the VIPERS database at the end of the 
201 1/2012 observing campaign. 

Using the automatic source classifier for VIPERS data is 
a natural step to handle this unique data volume. Automated 
and efficient source classifiers based on photometric observa- 
tions, can provide class labels for catalogues and be used to re- 
cover objects for study according to various criteria. Moreover, 
a multilevel SVM classifier, trained to search for specific types 
of sources such as Active Galactic Nuclei (AGNs) or galaxies, 
with an additional redshift measurement as a feature in the pa- 
rameter space, can be used to boost confidence in the reliability 
of redshift estimates for sources with poor spectroscopic data. 
We are planning to develop a more sophisticated and detailed 
classifier in the near future, enlarging the parameter space by 
adding measurements of spectral lines and galaxy morphologi- 
cal parameters, thus enabling a finer classification of our sources 
(e.g. distinguish among different galaxy types). 

In this work, we used VIPERS data both to construct 
a Training Sample and to select samples on which to ap- 
ply the classifier to separate three different classes of objects 
(galaxy/ AGN/star) . 

VIMOS-VLT Deep Survey (VVDS) 

VIPERS was designed as an extragalactic survey aiming at ef- 
ficient measuring of redshifts for a large sample of galaxies. 
To increase the efficiency, stars were carefully removed from 
the target candidates (which was particularly important for the 
W4 VIPERS field due to its low galactic latitude). To this 
aim, both morphological and spectral energy distribution fit- 
ting t echniques were used (see IGuzzo et afll2013l: ICoupon et al.l 
2009). However, it was also important to re-introduce AGNs, 
which were identified among the stellar objects by their pho- 
tometric properties (a more detailed description of AGN selec- 
tion c an be found in the survey description paper, IGuzzo et all 
120131) . Consequently, the number of observed stars and AGNs in 
VIPERS is quite small. 

To construct a reliable Training Sample (see Sec. |3j, we 
included data from another, similar but more complete sur- 
vey, VVDS. The VVDS fields, like VIPERS, are covered by 
CFHTLS (and partially by WIRCam observations) and thus 
the photometric information is homogeneous. Additionally, both 
surveys utilise the VIMOS spectrograph in similar configura- 
tions. The VVDS spectroscopic sample is based upon a purely 
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Input space Feature space 



Fig. 1. An illustration of the operation of the SVM algorithm. 
The input data (on the left side) are transformed by a kernel into 
the higher dimensional feature space (right side) where, instead 
of having a complex boundary separating different classes of ob- 
jects, we can find an optimal separating hyperplane. 



magnitude-limited selection such that the survey contains a 
much wider variety of sources than VIPERS. We used VVDS- 
Deep (F02 field) and VVDS-Wide (F22 field) surveys to con- 
str uct a Training Sample of AGNs (objects classified as AGNs 
by lGavignaud et al]|2007l) . The stellar sample was chosen from 
a part of VVDS Wide F22 which overlaps the VIPERS W4 field. 

The Deep F02 survey, covering 0.49 square degrees, 
is a purely magnitude limited sample to iAB < 24. The 
detailed description of the VVDS Deep survey may be 
found in iLe Fevre et al.l (120051) . The VVDS Wide F22 survey 
(iGarilli et al.ll2008l) . covering an effective area 3 square degrees, 
is also a magnitude limited survey with limitation to iAB =22.5. 

3. Method - Support Vector Machines 

The main purpose of the Support Vector Machine (SVM) is to 
calculate decision planes between a set of objects having dif- 
ferent class memberships. A so-called Training Sample, a train- 
ing set of objects, is used to provide the SVM with examples 
of the different classes of sources. The SVM searches for the 
optimal separating hyperplane between the n different classes 
of objects by maximising the margin between the classes clos- 
est points (the so-called support vectors). Instead of using the 
probability function like in Bayesian statistics or template- fitting 
methods, the objects are classified based on their relative posi- 
tion in the ^-dimensional parameter space with respect to the 
separation boundary. A well chosen Training Sample is at the 
heart of the method, because, based on the properties of the 
Training Sample, the classifier is tuned, and the hyperspace be- 
tween classes is determined. 

The SVM algorithm represents a major development in ma- 
chine learning techniques. It can be applied to classification or 
regression problems and is nowadays constantly growing in pop- 
ularity too to deal with astronomical data to distinguish differ- 
ent classes of sources based on a multidimens ional space of 
param eters taken from observations. Recently, Wo zniak et al.l 
(2004]) has efficiently used SVMs to analyse variable sources 
in a 5 -dimensional spa ce constructed from the period , ampli- 
tude and three colours; iHuertas-Company et al.l (120081) quanti- 
fied the morphologies of near-infrared galaxies based on 12- 
dimensional space, including 5 morphological parameters and 
other charac t eristic s of galaxies such as luminosity and redshift; 
ISolarz et all (120121) created a star-galaxy separation algorithm 



based on mid and near-infrared colours; Sagl ia et al 1 (120121) sep- 
arated three different classes of sources (galaxies, QSOs, and 
stars) from PAN-STARRS 1 survey, based on five photomet- 
ric bands. The last year brought a significant number of astro- 
nomical papers implementing supervised machine learning al- 
gorithms to handle various tasks, not only to classify sources 
but also to predict characte ristic features of specific objects. For 
example IPeng et al] (120121) used SVM to select A ctive Galactic 
Nuclei ( AGN) candidates and to estimate redshift, lHassan et al.l 
(120131) - to search specific AGN subclass: BL Lacertae and 
flat-spectrum radio quasars based on the Second Fermi LAT 
Catalogue). Clearly SVMs present an innovative method with 
a great potential to be widely used in many different branches of 
astronomy, a potential we are just beginning to tap into. 

We use the SVM algorithm to build a non-linear classifier 
for photometric data to select three different classes of objects: 
galaxies, AGNs and stars. The first step of our classification task 
involves selecting a secure Training Sample of galaxies, AGNs 
and stars, taking advantage of the redshift information provided 
by VIPERS and VVDS, and using their attributes - i.e. their ob- 
served photometric fluxes - to train the SVM. 

The algorithm, aided by a non-linear kernel function, 
searches for a hyperplane which will maximise the distance 
from the boundary to the closest points belonging to the 
separate classes of objects dCristianini & Shawe^ Tavlorll200Ql: 
IShawe-Taylor & Cristianinil 120041) . The kernel is a symmetric 
function O that maps k : X x X — > F, so that for all Xi and xj , 
k(xi, xj) =< <£>(xi), (f(xj) > from the input space X to the feature 
space F (see Fig.[T]). For our analysis we chose a Gaussian radial 
basis kernel (RBK) function, defined as: 

k(x 1 ,x j ) = exp(- r ||x 1 -x j || 2 ), (1) 

where ||xi - xj|| is the Euclidean distance between xi, and xj. The 
product of the kernel function is a non-linear representation of 
each parameter from the input to the feature space. The RBK 
kernel is one of the most popular SVM kernel functions, used to 
make the non-linear feature map. We decided to use it because of 
its effectiveness and simplicity in adjusting the free parameters. 

A schematic representation of the SVM algorithm classifica- 
tion process is shown in Fig. [2] 

For our tasks, we used a soft-boundary SVM method called 
C-SVM. We chose C-classification because of its good perfor- 
mance and only two free parameters: 

- C - a trade-off parameter that sets the width of the margin 
separating different classes of objects. A large C value sets 
a small margin of separation between different classes of 
objects; however increasing the C parameter too much can 
lead to over-fitting. Reducing C will make the hyperplane 
between different classes of objects more smooth, allowing 
for some mis-classifications. 

- y > parameter (related to the kernel function) determines 
the topology of the decision surface. A low value of y sets 
a very rigid, and complicated decision boundary; a value of 
y that is too high can give a very smooth decision surface 
causing mis-classifications. 

For our analysis we used LIBSVJV0 dChang & Linl 120111) . 
an integrated software for support vector classification, which 
allows for multi-class classification. We used R0, a free software 
environm ent for statist ical computing and graphics, with el 071 
interface dMeyeifcOOll) package installed. 

3 http : //www. c sie . ntu . edu . t w/cj lin/~ lib s vm/ 
4 http://www.r-project.org/ 
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Training Sample 








Test 
Sample 





Fig. 2. A schematic representation of the SVM algorithm clas- 
sification process. We take as input the pre-selected Training 
Sample consisting of (in the case of this work) three distinct 
classes of objects. The SVM is taught how to distinguish one 
class from the others based on the discriminating properties cho- 
sen as feature vectors. Then, the classifier is trained by tuning the 
free parameters (C and y). If the result reaches a high enough ac- 
curacy rate (the number of objects from the Training Sample that 
are correctly recognised by the classifier) without overfitting (the 
resultant hyperplane does not confine too tightly to the sources 
of a specific type), it will be used to classify the unknown ob- 
jects (test sample). If the accuracy is not satisfactory, a different 
parameter space (or Training Sample, if possible) is chosen to 
tune C and y. After a number of iterations, which allow the clas- 
sifier to reach an efficiency level high enough, a real sample can 
be classified using the discrimanent hyperplanes. 



data) which predicts th e target values o f the test data given only 
the test data attributes dHsu et al.ll201Qh . 



4.1. Galaxies 

As a galaxy Training Sample we used the sources with the best 
redshift measurements in both the Wl and W4 VIPERS fields 
(VIPERSzfiag = 4, corresponding to the highest confidence level 
of redshift measurements and thus of spectroscopic classifica- 
tion as a galaxy). It is useful to remember that VIPERS is pre- 
selected not only in magnitude (i'<22.5) but also in colours: 
(r'-i')>0.5*(u*-g) or (r'-i')>0.7. We have divided the galaxy 
training set into i' -based apparent magnitude-binned samples 
and trained the classifier on each subset. As a galaxy Training 
Sample we used 16 271 galaxies: 1 884, 5 483, 6 778, and 
3 226 for 19<i'<20, 20<i'<21, 2Ki'<22, and 22<i'<22.5 ap- 
parent magnitude-bins, respectively. Based on our initial tests, 
we decided to divide our galaxy sample into the magnitude bins 
to separate more efficiently different groups of galaxies seen in 
different i' apparent magnitude ranges to improve their classifi- 
cation. Fig.[3]shows that galaxies in different magnitude bins oc- 
cupy different areas of the colour-colour plots, partially because 
of different redshift range and different morphology. 



4.2. AGNs 

Given the small number of AGNs detected in the VIPERS fields 
with the VIPERSzfiag = 14, we increased the AGN sample by 
using all AGNs which had at least 99% confidence level of spec- 
troscopic classification (VIPERSzfiag 13 and 14, in total 398 ob- 
jects). AGN spectra are quite easy to recognize and thus a lower 
flag on the quality of the measured redshift does not infringe 
on the reliability of the classification as an AGN. There are two 
ways that an AGN can be observed in VIPERS: 

- it is star-like and meets the AGN candidate selection. This 
includes samples of X-ray selected AGNs from the XMM- 
LSS survey, overlapping the VIPERS Wl field (iPierre et al.l 
12004 . and AGNs selected based upon colour-colour criteria 
from the sample of star-like sources that would otherwise not 
be targeted. 

- it meets the galaxy selection criteria - AGNs which met the 
galaxy criteria during the main VIPERS colour pre- selection. 



4. Training Sample 

The successful application of an SVM algorithm requires a care- 
fully selected Training Sample - a set of objects with confirmed 
classes which will serve as a template for distinguishing the 
sources whose class we want to determine. Since this work is 
focused on the selection of galaxies, AGNs, and stars we se- 
lect as a Training Sample a set of sources whose basic class 
(galaxy, AGN or star) was established with the highest relia- 
bility thanks to their high quality spectra (their redshift being 
measured with the highest confidence flag within the VIPERS 
or VVDS surveys). For these sources the accurate photomet- 
ric information provided by the CFHTLS Wide- survey and the 
WIRCam follow-up observations of the VIPERS/VVDS fields, 
provided the colour information needed to create the discrim- 
inant vectors for training our SVM algorithm. We produced a 
model (the optimised C and y parameters based on the training 



We stress that the color pre-selection for galaxies and AGNs 
is slightly different, and AGNs occupy only a part of the full 
colour-colour galaxy plane. The first AGN colour separation cri- 
terion CCiagn- 



(g'-r')<lA 



1. (U* - g) CO rr < 0.6, 

0.6<(k*-sW<1.2& 
(£'-r'W>0.5( M *-£') ( 
0.6 < (ii* - g) corr < 2.6 & 

3 ' (#' - r')« 
4. (ii* 



(M* " g')corr + 0.036, 
)<(«*- g)corr < 2.6 & W 

- r') corr < 0.5(w* -g') CO rr + 0.214, 

- g')corr > 2.6, 



where (w* - g) CO rr, and (g' - r') corr correspond to tile color off- 
set. This set of colour-colour selection criteria was inspired from 
VVDS survey. After one year of observation it turned out that 
this allows for stars contamination at the level ~60%. From 
August 2010 additional criterion CC2AGN, including (g'-i') vs 
(u*-g') color-color plane, was added to eliminate stellar sample 
from AGNs targets. The set of colour-colour criteria included to 
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-1 012345678 -1 012345678 -1 012345678 

u*-i' u*-g' u*-r' 

Fig. 3. The representative colour-colour plots for galaxy Training Sample. Open black squares represent objects with i' -apparent 
magnitude between 19 and 20 mag; green X-s - galaxies with i' magnitude between 20 and 21 mag; objects with i' apparent 
magnitude between 21<i<22, and 22<i<22.5 mag are marked as blue +-s and open red triangles, respectively; In the middle panel 
of colour-colour plots, the boundaries of VIPERS selection are marked as magenta lines. 
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Fig. 4. The representative colour-colour plots for the AGN Training Sample. Full magenta triangles represent objects brighter than 
19 mag in the i' band. Open black triangles - AGNs with i-apparent magnitude between 19 and 20 mag; open green circles - AGNs 
with i' magnitude between 20 and 21 mag; objects with i' apparent magnitude between 21<i<22, and 22<i<22.5 mag are marked 
as open blue squares and open red diamonds, respectively; AGNs with i' apparent magnitude fainter than 22.5 are marked as open 
rotated cyan triangles. 




-1 012345678 -1 012345678 -1 012345678 

u*-i' u*-g' u*-r' 

Fig. 5. The representative colour-colour plots for the star Training Sample. Open black triangles - stars with i-apparent magnitude 
between 19 and 20 mag; open green circles - stars with i' magnitude between 20 and 21 mag; objects with i' apparent magnitude 
between 21<i<22, and 22<i<22.5 mag are marked as open blue squares and open red diamonds, respectively; 



CC2AGN is listed below: 



1. (U* - g') corr < 0.6 & - 0.2 < (g' - O < 1, 

0.6 < (11* - g') corr < 1 & 
z ' - 0.2 < (g' - /') < 0.2, 

3.(u*-g') CO rr> 1 & fe' - I*) < 0.6. 



(3) 



Therefore, both criteria (Eq.|2]and Eq.|3} applied simultaneously 
defined VIPERS AGN targets. However, most of the AGNs share 
the same colour-colour space as galaxies (as can be seen in 
Fig. O, and that is why it is a challenge to distinguish all three 
classes of objects using an automatic classifier. 
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-0.5 



-1 012345678 

u*-g' 

Fig. 6. The representative colour-colour plot for VIPERS galax- 
ies with VIPERSzflag = 4 (pink x-s) and AGNs with VIPERS zflag 
= 3 and 4 (open blue circles). A part of AGNs occupy different 
colour-colour area than galaxies and for them, the galaxy/AGN 
separation is not so difficult. For objects classified as AGNs lay- 
ing in the same colour-colour plane, the galaxy /AGN/star sep- 
aration is more challenging. For this reason we decided to use 
SVM with ^-dimensional photometric parameter space to clas- 
sify sources with similar properties in the typical colour- colour 
plane. 



To enlarge the AGN Training Sample, we also merged the 
VIPERS sample with objects classified as broad-line AGNs in 
the VVDS su rvey. In our Training Sa mple we included AGNs 
identified by iGavignaud et al.l (I2007I) - a catalogue of broad 
emission-line AGNs, from the purely flux-limited spectroscopic 
sample of the VVDS survey. No color-based pre-selection has 
been applied in case of these AGNs . For our studies we used 
100 AGNs from VVDS Deep F02 (|Le Fevre et al.l 12005k and 
VVDS Wide F22 (iGarilli et al J 12008) fields only. We selected 
these fields since they have the same CFHTLS photometry sys- 
tem as the VIPERS survey. 

Cumulatively, our AGN Training Sample reached 498 ob- 
jects. A part of them, observed by VIPERS, is colour prese- 
lected. AGNs from VVDS fields have no colour pre-selection 
(flux-limited only). As we checked on colour-colour plots (see 
Fig- HI), in the different magnitude bins, we do not see a change 
in population of our AGN sample with apparent luminosity. For 
this reason, unlike in the case of the galaxy sample, we decided 
not to divide the AGN Training Sample into i' -based apparent 
magnitude binned samples, but to use it as a whole in each bin 
to increase the population of the training AGNs. 



4.3. Stars 

VIPERS performed a star/galaxy classification in the CFHTLS 
wide fields to effectively remove stars from the sample of 
observed targets. This procedure is crucial since at i'<22.5 
the frac tion of stars can b e as high as 50% (as in the case 
of W4, iGuzzo et al.l l2013h . The basic VIPERS classification 
procedure was based on the colour-colour pre-selection with 
(r - i) > 0.5 * (u - g) or (r - i) > 0.7, but due to the low galac- 
tic latitude of W4 field, VIPERS implem ented an additional pro- 
cedure. We refer the interested reader to Guzzo et al. 1 (50131 for 
a complete description on the adopted strategy, here it is suf- 



ficient to mention that for objects brighter than i' =21 an ad- 
ditional pre-selection based on the observed angular size of 
sources was applied, while for objects fainter than i '=21 a com- 
bined method maki ng use of an angular s ize and SEP fittin g by 
the Le Phare code (lArnouts et al.l 1 19991: lllbert et al.l 120061) has 
been used. These pre-selection criteria proved to be very effec- 
tive. However, the average stellar contamination in the VIPERS 
database, for both fields, remains on the level of 3.2% (1.49% 
and 4.86% for Wl and W4 field, respectively). It means that 
in the VIPERS PDR-1 catalogue, which includes 55 358 ob- 
jects, only 1 750 objects have been identified as stars. In sum, 
the VIPERS PDR-1 catalogue contains 1 750 (3.20%) stars clas- 
sified as galaxies in the beginning, with colors compatible with 
an object at z>0.5. This stellar sample can be divided into two 
main groups: 

- stars which were not distinguishable from galaxies based on 
the VIPERS pre-selection criteria, and 

- stars, which were included in the sample as AGN candidates. 

Then, it should be stressed that the stars observed by VIPERS are 
interlopers within the galaxy and AGN samples and thus are not 
representative of the stellar class. However, our method uses the 
multi-dimensional colour space which opens a possibility that in 
such a space, these sources may occupy a region separated from 
galaxies and AGNs. 

To build an unbiased star Training Sample we added spec- 
troscopically classified stars from the VVDS Wide F22 overlap 
with the VIPERS W4 field. VVDS Wide F22 observations were 
carried out on the same magnitude limits sample as VIPERS, 
however, without any photometric pre-selection. The overlap be- 
tween the VVDS Wide F22 and VIPERS W4 fields contains 
920 objects spectroscopically classified as stars by VVDS in 
the 19<i'<22.5 apparent magnitude bin. We increased the stel- 
lar Training Sample by using all VIPERS stars with VIPERSzflag 
equal to 4, in the same apparent magnitude bin (1 312 objects). 
Cumulatively, our stellar Training Sample reached 2 232 objects. 

Similarity to the case of the AGN Training Sample, we did 
not divide the stellar Training Sample in i' -based apparent mag- 
nitude bins. As shown on the representative colour-colour plots 
for the different magnitude i' bins (Fig. 0, we did not observe 
a significant change in the distribution of our stellar sample as a 
function of apparent luminosity. 

4.4. Oversampling 

Our Training Sample includes more than 16 thousands galaxies, 
and only 2 232 stars and 498 AGNs (Fig. [7] shows the repre- 
sentative colour-colour plots for galaxies, AGNs and stars cho- 
sen for the best Training Sample set). Sampling strategies, such 
as oversampling and under- sampling, are a popular solution in 
tackling the problem of classification due to the fact that the 
SVM classifier is sensitive to a high class imbala nce, result- 
ing in a drop in the class i fication performance (e.g.. | Tang et al.l 
120091: lAkbani et al]l2004t iRaskutti & KowalczvUI 20041) . An un- 
balanced training set tends to over- predict the majority class for 
unknown sources dTian et al.ll201ll) . 

To avoid this effect, we performed an oversampling of the 
AGN and stellar training sets so that in each considered magni- 
tude bin we had a similar effective number of objects classified 
as galaxies, AGNs and stars, respectively. In fact, despite our de- 
cision of not splitting AGN and star classes into magnitude bins, 
unlike we did in the case of galaxies, the imbalance between the 
numbers of representatives in each class remains high. 
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Table 2. Number (N) of galaxies, AGNs, and stars in our 
Training Sample after performed oversampling method. 





19<i'<20 


2(Ki'<21 


21<i'<22 


22<i'<22.5 


N Galaxies 


1 884 


5 483 


6 778 


2 126 


NAGNs 


1520 


4 440 


5 440 


1 760 


N Stars 


2 232 


4 440 


5 440 


2 232 



Using a simple oversampling technique we raised the effec- 
tive number of AGNs and stars up to -80% of the number of 
galaxies in each magnitude bin considered. 

We therefore added in each magnitude bin a number of arti- 
ficial objects calculated as: 

rXi_™ SS inglio = NGi*0.8-X (4) 

where X, missing is a number of missing objects (AGNs, stars), 
and symbol Hio corresponds to rounding the value up to the 
nearest 10. The additional artificial objects were created by 
shifting the observed magnitudes by an amount drawn from a 
Gaussian distribution with <x=0.05. We also checked how the 
stellar and AGN Training Samples work if we do not perturb 
the colours, but instead populate real objects multiple times. As 
might be expected, the results of classifiers were worse than with 
randomly modified stars and AGNs. 

Tab. [2] summarizes the numbers of training galaxies in each 
magnitude-binned set together with the number of AGNs and 
stars after oversampling. 

5. Results 

5.1. Training procedure 

In order to build a classifier which will be able to separate dif- 
ferent classes of objects, it is necessary to tune the C and y pa- 
rameters using the Training Sample. For the best performance, 
we have performed a grid-search with values from: y e 10 (_3:_1) 
and C e 10 (0:3) using a 10-fold cross-validation technique. We 
first divided the full Training Sample into 10 subsets of equal 
size and selected 9 subsets to train the classification model and 
test it against the remaining subset (the so-called self-check). 
This test was repeated 10 times, with a different subset removed 
for each training run. The classification accuracy was then aver- 
aged over the 10 runs. This process was repeated for each value 
of the parameters C and y. In Fig. [8] we present a representa- 
tive plot of the the grid-search, done for the apparent magnitude 




log(Y) 



Fig. 8. The mean mis-classification rate as a function of C and 
y as estimated from the 10-fold cross-validation technique per- 
formed for each pair of parameters (see text for more details). 
The lower ratio of mis-classification, the better the performance 
of the SVM algorithm. 



bin 19<i'<20. The colour of each pointing of the grid codes the 
mean mis-classification rate of all y and C values (in log scale 
on the X and Y axis, respectively). The mis-classification rate is 
defined as (1 -Total Accuracy) for each magnitude bin (see Eq.0 
further in the paper): the lower the ratio of mis-classification, 
the better the performance of SVM algorithm. We would like 
to stress, that a change of the parameter space (such as adding 
more parameters describing properties of sources) or a sufficient 
change of the number of training objects inside one class, may 
result in altering the occupancy of training objects and therefore 
requires a re-calculation of the best parameters. 

To check the efficiency of our classifiers, we count the true 
objects (true galaxies - TG, true AGNs - TAGN and true stars 
- TS from the Training Sample originally classified as galax- 
ies, AGNs and stars, respectively), and false objects: FG (false 
galaxy: when a source from the stellar or AGN Training Sample 
is classified as a galaxy by the SVM); FS (false star: when an 
object from a galaxy or AGN Training Sample is classified as a 
star by the SVM), and FAGN (false AGN: when an object from 
a galaxy or star Training Sample is classified as an AGN by the 
SVM). We then calculate the accuracy of our classifier based on 
the formula: 
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TG + TAGN + TS 
ccuracy - TG + xAGN + TS + FG + FAGN + FS ' 

After completing the 10-fold cross-validation process we cal- 
culated the total accuracy of the SVM classifier, defined as the 
mean accuracy for all iterations: 

2^ Accuracyi 
Total Accuracy = — — , (6) 

where N = 10 is the number of validation iterations. We per- 
formed such a check in each magnitude bin considered. 

In our work for galaxy/AGN/star classification we used both 
a three and five dimensional colour space. The first one was built 
using only optical data, corresponding to (u*-g'), (g'-r), and (r'- 
i) colours, while the second one included NIR data and thus used 
two extra colours: (i'-z'), and (z'-K s ). 

5.2. Optical u*g'r'i' classifier 

We constructed colour-colour Training Samples without near- 
infrared data, based only on the optical u*, g', r', and i' filter 
bands (a three-dimensional hyperspace). We found that the Total 
Accuracy as well as the number of correctly classified objects 
for this approach depend on the apparent magnitude of objects. 
Averaging over all magnitude bins (19<i'<22.5), once we aver- 
age results by the number of objects in each bin, the mean Total 
Accuracy for the optical classifier is equal to 86.39%. 

The results of the self-check of our classifier are shown in 
Tab. [3l showing that only in a few percent of cases (less than 
11% in all magnitude bins) galaxies are classified as a star 
or as an AGN. The most frequent mis-classifications occur in 
the 19<i'<20 bin, in which galaxies are correctly classified at 
the level of 88.82%, AGNs - 69.45%, and stars at the level of 
78.79%. The mis-classifications between stars and galaxies are 
noticeable in the first three bins. For 20<i'<21 and 21<i'<22 
bins more than 10% of spectroscopically classified stars are clas- 
sified by the SVM as false galaxies (15.06% and 10.01%, respec- 
tively). In the same bins, AGNs are mis-classified as galaxies at 
the high level of 6.23% and 15.50%, respectively. 

The mis-classification of galaxies and AGNs happens mainly 
in the bins where the percentage of oversampled objects in- 
creases. The reason may be related either to our oversampling 
method or to the worse accuracy of photometry for the fainter 
sources, as well as to intrinsic properties of classified sources in 
these bins. We stress that for the SVM method the 100% level 
of self-check is not desirable since it may be indicative of over- 
fitting. The boundaries between different classes of objects de- 
fined by the Training Sample may become too rigid and artifi- 
cially complex, not allowing for effective classification of real 
sources. Nevertheless, it seems that the present, very basic clas- 
sifier, created on the basis similar to the standard colour-colour 
approach, works well for our Training Sample. 

We next apply our trained classifier to VIPERS galaxies with 
redshift quality flag VIPERSzfiag = 3, corresponding to a con- 
fidence of the redshift measurements - and correspondingly of 
correct identification as a galaxy - of > 99% (hereafter GAL3). 
Tab. [5] shows that GAL3 are correctly classified at a level higher 
than 85% with an obvious decrease of success with apparent lu- 
minosity. The percentage of mis-classifications is constant, at a 
level of -10%. The strong contamination by false stars is visible 
for objects fainter than i' =21 mag. It is reassuring that this trend 
is similar to the self-check results (Tab. [3j demonstrating that 



Table 5. Test of SVM optical classifier on the galaxies with 
VIPERSzfiag equal to 3. In the first row we show the percent- 
age of correctly classified galaxies. Second and third rows show 
the percentage of miss-classified galaxies: when a true galaxy is 
classified by SVM as an AGN or a star, respectively. 





19<i'<20 


20<i'<21 


2Ki'<22 


22<i'<22.5 


Galaxies 


90.97 


91.41 


85.38 


88.82 


False AGNs 


2.76 


2.81 


3.06 


4.45 


False stars 


6.27 


5.78 


11.56 


6.73 



Table 6. Test of SVM classifier with near-infrared data on the 
galaxies with VIPERSzfiag equal to 3. The first row represents 
the percentage of correctly classified galaxies. Second and third 
rows show the percentage of mis-classified galaxies: when a true 
galaxy is classified by SVM as an AGN or a star, respectively. 





19<i'<20 


20<i'<21 


21<i'<22 


22<i'<22.5 


Galaxies 


95.38 


95.17 


93.09 


92.72 


False AGNs 


2.42 


2.72 


4.30 


5.29 


False stars 


2.20 


2.11 


2.61 


1.99 



the Training Sample is representative of the data. In the fainter 
magnitude bins the photometric errors increase such that the op- 
tical u*, g', r', and i' fluxes are not as efficient in distinguishing 
galaxies and stars. 

5.3. Optical+NIR (ifgYi'zTQ classifier 

We enlarge the parameter space by adding the NIR colours (z' 
and K s ) to our classifier (a five dimensional hyperspace). We 
performed the same tests as for the optical classifier (self- check, 
and test on VIPERS GAL 3 ). 

Our Training Sample, composed of exactly the same sources 
as optical classifier, but with NIR measurements, allows us to 
train a new optical + NIR classifier. The mean Total Accuracy 
for this classifier is equal to 94.29%, i.e. higher than the pure 
optical one. Total Accuracy for particular magnitude bins stays 
on the similar level ~ 95% for the whole i' -apparent magnitude 
binned sample. The constancy of the new classifier for objects 
fainter than 20 mag in i' band is very promising for the next 
tests and final classification of VIPERS objects. 

Tab. |H shows the self-check for the u*, g', r', i', z' and K s 
space classifier. Averaging over all magnitude bins, galaxies are 
correctly classified in ~ 97.03%, AGNs in 95.13%, and stars in 
97.05% of the cases, once average results by mean number of 
objects in each bin. All these numbers are significantly higher 
than those for purely optical classifier. In case of AGNs, the dif- 
ference between correctly classified sources for optical and op- 
tical+NIR classifiers is equal to 26.46%, 5.46%, 13.30%, and 
13.88% for 19<i'<20, 20<i'<21, 21<i'<22, and 22<i'<22.5 ap- 
parent magnitude-bins, respectively. Stars are correctly classi- 
fied at higher level than AGNs, with a difference between optical 
and optical+NIR classifiers equal to 17.58%, 16.79%, 10.94%, 
and 3.41% for the same magnitude-bins. 

Applying this classifier to VIPERS galaxies with 
VIPERSzfiag equal to 3 (GAL3, Tab. [6]) shows, that galax- 
ies are correctly classified at the mean, very high level of 
93.60% (we average results by the number of objects in each 
bin). Incorrect galaxy classifications, i.e. false AGNs and false 
stars, are very rare, and do not exceed 2.65% for stars, and 
5.30% for AGNs. 
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Table 3. Results of the self-check of the purely optical classifier (u\ g\ r', and i' only). Columns corresponds to the true (spectro- 
scopically classified) galaxies, stars and AGNs. Rows correspond to objects classified as galaxies, AGNs, and stars by our classifier. 
Then values in bold correspond to the correctly classified objects (galaxies, AGNs, and stars) in defined i' -based apparent magnitude 
bins. Ratios of classified objects are given in percentage. 





19<i'<20 


20<i'<21 


21<i'<22 


22<i'<22.5 


Total Accuracy 


85.01% 


87.38% 


85.09% 


88.09% 


SVM/true 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Number of sources 


1 884 


1 520 


2 232 


5 483 


4 440 


4 440 


6 778 


5 440 


5 440 


2 126 


1 760 


2 232 


Galaxy 

AGN 

Star 


88.82 

4.45 
6.73 


15.70 
69.45 

14.85 


10.98 
10.23 
78.79 


92.10 

3.28 
4.62 


6.23 
90.88 

2.89 


15.06 

4.48 

80.46 


88.39 

4.04 
7.57 


15.50 
81.54 

2.96 


10.01 

3.81 

86.19 


93.18 

4.37 
2.46 


17.47 
79.06 

3.47 


3.00 
3.28 
93.72 



Table 4. Results of the self-check of the classifier with the near-infrared data (u*, g', r', i', z', and K s ). Columns corresponds the 
true (spectroscopically classified) galaxies, stars and AGNs. Rows correspond to objects classified as galaxies, AGNs, and stars by 
our classifier. The values marked in bold, are correctly classified objects (galaxies, AGNs, and stars) in defined i' -based apparent 
magnitude bins. Ratio of classified objects are given in percentage. 





19<i'<20 


20<i'<21 


21<i'<22 


22<i'<22.5 


Total Accuracy 


95.47% 


95.83% 


94.28% 


94.58% 


SVM/true 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Number of sources 


1 884 


1 520 


2 232 


5 483 


4 440 


4 440 


6 778 


5 440 


5 440 


2 126 


1 760 


2 232 


Galaxy 

AGN 

Star 


96.28 

2.44 
1.28 


2.90 
95.91 

1.19 


1.27 
1.70 
96.37 


97.61 

1.95 
0.44 


1.95 
96.34 

0.27 


0.44 
0.80 
97.25 


97.11 

2.52 
0.37 


5.00 
94.83 

0.17 


2.10 
0.77 
97.13 


96.10 

3.38 
0.52 


6.09 
92.94 

0.97 


1.57 
1.30 
97.13 





Fig. 9. Total Accuracy for optical and optical+NIR classifiers 
(see Tabs. [3] and |4]). Results for the optical classifier based on 
the u\ g', r', and i' filter are marked as a dotted line. Solid line 
corresponds to the Total Accuracy of the optical+NIR classifier. 



Fig. 10. The accuracy of optical and optical+NIR classifiers for 
VIPERS galaxies with VIPERS zfl ag equal to 3 (GAL 3 ). Results 
for classifier based on the u\ g', r', and i' filter only are marked 
as a dotted line. Solid line corresponds to the classifier with the 
NIR data (u*, g\ r\ i\ z\ and K s ). 



We can observe the trend that galaxies have an increased risk 
of being mis-classified as AGNs in the faintest magnitude bins. 
One possible explanation of this behaviour is the decrease of the 
quality of the photometry for the less luminous sources, which 
have a lower signal-to-noise ratio. On the other hand, the limiting 
magnitude of CFHTLS is much deeper than the VIPERS one, 
and photometry should still be fairly good down to mag i' 22.5. 
Another explanation could be that some of these galaxies are 
hosting faint AGNs which were not previously recognized, since 
with the decreasing luminosity the host galaxy becomes dimmer 
and the AGN component becomes more significant. This possi- 
bility will be further examined in future works. 



5.4. Comparison of the classifiers 

In Fig. [9] we compare the Total Accuracy for vthe optical and op- 
tical+NIR classifiers. However, on average the classifier based 
on the u*, g', r', i', z' and K s bands, is 7.90% better then the 
classifier trained without z' and K s data. Moreover, the Total 
Accuracy of the optical+NIR classifier decreases very weakly 
with the apparent magnitude, while such a decrease is slightly 
stronger for the purely optical classifier; between the first and 
the second apparent magnitude bin the difference between their 
Total Accuracy rises from 6.49% to 10.46% from the fainter to 
the brighter bin. 

The preponderance of the classifier constructed with the NIR 
data is confirmed by the efficiency of correctly classification of 
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Fig. 11. The representative colour-colour plot for all objects used 
for consistency check for VIPERS object with redshift confirma- 
tion level of > 99%, with i' apparent magnitude between 19 and 
22.5. Pink x-s represents galaxies with VIPERSzfl ag =3. Open 
blue circles correspond to AGN sample with redshift confirma- 
tion level equal to or higher than 99% (VIPERSzfiag equal to 13 
and 14). Open black squares correspond to stellar sample with 
VIPERSzfiag equal to 3 and 4. 



galaxies with VIPERSzfiag equal to 3 (GAL3). Fig. [TO] shows the 
comparison of Accuracy of both classifiers (with and without 
near-infrared data) for the GAL 3 sample. For the fainter objects 
(21<i'<22), the efficiency is decreasing rapidly for the classi- 
fier trained without z' and K s bands, and much smoother for the 
more sophisticated classifier, trained with infrared features. 

We conclude that including NIR data to train the 
SVM algorithm significantly improves the efficiency of the 
galaxy/ AGN/star classifier. It is evident that NIR features are 
very important to build an effective classifier for basic astro- 
nomical classification these three classes of sources. Based on 
the above tests, we decided to choose the classifier based on the 
u*, g\ r\ i\ z', and K s bands to be used in our next analysis. 

6. Consistency checks on VIPERS data 

6.1. VIPERS objects with redshift confirmation level 
of ^99% 

We now apply the optical+NIR classifier to VIPERS data only: 

- galaxy sample - all (GAL3) galaxies in i' -apparent magni- 
tude range between 19 and 22.5 mag, with the total number 
of sources equal to 13 539, 

- AGN sample - all AGNs detected by VIPERS, with redshift 
confirmation level equal to or higher than 99%, and with i' 
apparent magnitude between 19 and 22.5 (367 objects). All 
of t hese AGNs were used to build the Training Sample (see 
Sec. 14.2b which means that our classifier should know their 
position in our 5-dimensional space of parameters, This is 
not as worrisome as it may look due to the high oversampling 
needed for AGN sample (more than 200% for the brightest 
and the faintest apparent magnitude bins, and almost 800% 
for 20<i'<21 and 21<i'<22 for i' -apparent magnitude bins) 
that significantly erases the possibly peculiar characteristics 
of the 367 AGN chosen for the Training Sample. 



- stellar sample - all spectroscopically detected stars, with con- 
firmation level of > 99% (VIPERS zfl ag equal to 3 and 4), and 
i' apparent magnitude between 19 and 22.5 (1 729 stars). 
All stars with VIPERSzfiag =4 were used as a part of stellar 
Training Sample. 

Fig. [TT] shows the representative colour-colour plot for GAL 3, 
AGNs with VIPERSzfiag equal to 13 and 14, and stars with 
VIPERSzfiag equal to 3 and 4, chosen for the consistently check. 

For this test, all three classes of sources were divided 
into four i' -apparent magnitude bins (19<i'<20, 20<i'<21, 
21<i'<22, and 22<i'<22.5), the same as used in the Training 
Sample. Then, we applied our optical+NIR classifier on this 
data. 

Tab. |7] shows the results of the automatic classification. 

The mean accuracy for galaxies, averaged over the mean 
number of objects in each apparent magnitude bin, equals 
93.60%. This result for galaxy classification displays only 
slightly less level of efficiency (-1.50%) than galaxy classi- 
fication obtained during the self-check of the classifier (see 
Sec. 15. 31) . It means that the hyperspace of galaxy parameters used 
for the Training Sample is well defined. 

The result of AGN classification is worse than the one ob- 
tained during the self-check but still satisfactory. Averaging over 
all magnitude bins, AGNs are correctly classified at a level equal 
to 8 1 .80% with a significant decrease with i' apparent magnitude 
between 21 and 22 mag. Stars are correctly classified at the high 
mean level of 92.52% with a significant drop for the 22<i'<22.5 
apparent magnitude bin (84.47%). The performance of the clas- 
sifier in the case of AGNs may look relatively poor. However, as 
already mentioned, we should keep in mind that the VIPERS se- 
lection allows AGNs pre-classified as galaxies or stars based on 
their colour properties. Keeping this in mind, we should rather 
feel satisfied that a high fraction of these AGNs can be sepa- 
rated into a different section of the 5-dimensional hyperspace 
from galaxies and stars, using AGN Training Sample consists 
only 498 objects. 

We did not find any crucial mis-classifications for the galaxy 
sample. The galaxies are classified correctly, on a very high 
level. For the AGN sample, the contamination of true AGNs 
classified as a galaxies (8.17%, 7.37%, 10.46%, 14.90% for 
the 19<i'<20, 20<i'<21, 2Ki'<22, and 22<i'<22.5 bins, re- 
spectively) and stars (8.96%, 10.55%, 6.79%, 9.56% for the 
19<i'<20, 20<i'<21, 21<i'<22, and 22<i'<22.5 bins, respec- 
tively) is significant. For the stellar sample, the classifier more 
often mis-classified true stars as galaxies than AGNs. In the fu- 
ture development of this classifier, we will include the morpho- 
logical information as well as emission/absorption lines, which 
should improve the algorithm and increase the percentage of cor- 
rectly classified sources as well. 

6.2. VIPERS objects with redshift confirmation level lower 
than 99% 

We preformed a classification for VIPERS objects with confir- 
mation level lower than 99%. In particular, we used galaxies, 
AGNs, and stars from the VIPERS database, with quality of the 
measured redshift, VIPERSzfiag, equal to 2 and 1. VIPERSzfiag 
equals to 2 means that the measured redshift is fairly secure, 
with confidence level > 95%. Objects with VIPERSzfiag equal to 
1 are more tentative, and their redshift measurement was based 
on weak spectral features and/or continuum shape. For these 
objects there is a ~ 50% probability that the redshift could be 
wrong. A more detailed description of the VIPERSzfiag and qual- 
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Table 7. Results of the test of the optical+NIR classifier for GAL3, and AGNs and stars with redshifts measurements on a con- 
firmation level > to 99%. AGNs from this sample were used for building Training Sample. The rest of the objects is not related 
to Training Sample. Values marked in bold correspond to the correctly classified objects (galaxies, AGNs, and stars) in i' -based 
apparent magnitude bins. The ratio of the classified objects is given in percentage. 





19<i'<20 


20<i'<21 


21<i'<22 


22^i'<22.5 


SVM/true 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Number of sources 


445 


69 


337 


3 271 


1340 


428 


7 667 


127 


701 


2 156 


37 


263 


Galaxy 


95.38 


12.52 


4.17 


95.17 


7.37 


3.27 


93.09 


10.46 


3.42 


92.72 


14.90 


9.09 


AGN 


2.42 


77.34 


3.70 


2.72 


82.08 


3.27 


4.30 


82.75 


1.43 


5.29 


75.54 


6.44 


Star 


2.20 


10.14 


92.13 


2.11 


10.55 


93.46 


2.61 


6.79 


95.15 


1.99 


9.56 


84.47 



Table 8. Results of the optical + NIR classifier for galaxies, AGNs, and stars with redshifts measurements on a confirmation level 
equal to 95% (VIPERSzfiag = 2). Objects are not related to Training Sample. Values marked in bold correspond to the correctly 
classified objects (galaxies, AGNs, and stars) in i' -based apparent magnitude bins. The ratio of the classified objects is given in 
percentage. 





19<i'<20 


20<i'<21 


21<i'<22 


22<i'<22.5 


SVM/true 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Number of sources 


8 


33 


10 


945 


75 


27 


5 757 


80 


145 


6 226 


48 


159 


Galaxy 


84.15 


6.25 


30.00 


94.18 


12.00 


22.22 


92.81 


20.25 


29.65 


58.62 


4.17 


17.61 


AGN 


9.75 


93.75 


40.00 


3.49 


88.00 


29.63 


4.41 


77.22 


11.03 


20.56 


93.75 


18.87 


Star 


6.10 


0.00 


30.00 


2.33 


0.00 


48.15 


2.78 


2.53 


59.32 


20.82 


2.08 


63.52 



Table 9. The results of the optical + NIR classifier for galaxies, AGNs, and stars with redshifts measurements on a confirmation 
level equal to 50% (VIPERSzfiag equals to 1). Objects are not connected with Training Sample. Values marked in bold correspond 
to the correctly classified objects (galaxies, AGNs, and stars) in i' -based apparent magnitude bins. The ratio of the classified objects 
is given in percentage. 





19<i'<20 


20<i'<21 


21<i'<22 


22<i'<22.5 


SVM/true 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Galaxy 


AGN 


Star 


Number of sources 


35 


8 


4 


355 


13 


24 


2 833 


35 


81 


3 157 


30 


139 


Galaxy 


85.71 


50.00 


25.00 


92.68 


23.08 


37.50 


88.14 


20.00 


60.56 


43.09 


23.33 


25.90 


AGN 


11.43 


50.00 


75.00 


3.94 


76.92 


37.50 


7.12 


68.57 


9.86 


31.03 


66.67 


31.65 


Star 


2.86 


0.00 


0.00 


3.38 


0.00 


25.00 


4.74 


11.43 


29.58 


25.88 


10.00 


42.45 



ity of measured redshi fts can be found in iGuzzo et aL l d2013h . 
and lGarilli etafl (12012b . 

Results of SVM classification of objects with VIPERSzfiag 
equal to 2 (Tab. [8]) and 1 (Tab. [9]) show a very good confor- 
mity to the previously user supervised estimations. Galaxies 
are classified with agreement to redshift measurements on the 
mean level of 76.45% for VIPERS zfl ag = 2 and 66.08% for 

VIPERSzfiag = 1. 

The ongoing scientific analysis of galaxy evolution and clus- 
tering is mainly based on objects which have a secure redshift 
measurements (VIPERSzfiag > 2, depending on the topic). With 
the SVM classification we can reconfirm the identify of galax- 
ies with the lower quality flags and thus increase the number 
of galaxies which could be used for more detailed analysis. 
This may apply to 4 735 galaxies, 58 AGNs, and 86 stars with 
VIPERSzfiag equal to 1 (these numbers were calculated as a sum 
of galaxies, AGNs, and stars which were classified to the same 
class of objects during redshift validation and by optical+NIR 
classifier; marked in bold in Tab. 0). This method may also re- 
confirm the class to 9 952 galaxies, 177 AGNs and 160 stars 
with VIPERSzfiag = 2 classified as galaxies, AGNs, and stars 
by checking the results twice by different observers, and by our 
classifier. 

One may argue that the VIPERSzfiag is related to the red- 
shift value, not the identification of the galaxy itself. However, it 



should be noted that a majority of sources with low VIPERSzfiag 
are absorption line systems with noisy, low signal-to-noise spec- 
tra. Galaxies with such spectra can be particularly easily mis- 
classified as stars during the spectroscopic measurement process, 
either automatic, or human-supervised. For instance, typical fea- 
tures of the elliptical galaxy at z~ 1 , around B aimer break, can be 
confused with characteristic features of the M-type star. In such 
cases, an independent confirmation that the position of an object 
in the 5 -dimensional colour space is actually typical for a galaxy, 
actually increases also probability that its redshift has been as- 
signed correctly. 

The number of stars and AGNs in the sample of objects with 
VIPERSzfiag equal to 1 and 2 is very low (a few objects in the 
brightest apparent luminosity bin, and a few dozens for objects 
with i' magnitude lower than 21 mag). This fact results from 
an initial star/galaxy separation performed by VIPERS. In the 
VIPERS database the stars, which remained after the colour- 
colour pre-selection, are not typical, and they occupy a similar 
area as galaxies on the colour-colour plots. Then, the fact that 
we can reconfirm the identity of a significant fraction of them, 
can be already regarded as a success. The spectra of stars with 
VIPERSzfiag = 1 or 2, which were classified as galaxies by our 
classifier, will have to be re-examined since some of them might 
be also be genuine galaxies. 
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We should consider the fact that VIPERS galaxy sample 
includes different types of AGNs, which are in the VIPERS 
database classified as galaxies, also for objects with VIPERSzfiag 
higher than 3. During the standard redshift measurements pro- 
cess only the broad line AGNs are being recognized and flagged. 
It implies that our galaxy Training Sample contains, in addition 
to a pure sample of normal galaxies, also specific types of AGNs, 
otherwise difficult to recognize in VIPERS spectra during a stan- 
dard redshift measurements process. VIPERS hunt for the spe- 
cific types of AGNs lurking within the heap of collected sources 
is still ongoing, hence, at least for now, we are forced to work 
with the data composed of both galaxies and AGNs. The contam- 
ination of galaxies and AGNs is most prominent for the faintest 
bin (22<i'<22.5), where more than 20% (30%) of objects clas- 
sified as galaxies with VIPERSzfiag equal to 2 (1) are identified 
as AGNs by optical+NIR SVM classifier. 

We look forward to use SVM methods adding further in- 
formation on spectral lines and source morphologies as a very 
promising tool to improve classification for fainter sources and 
to refine further classes of objects that the software can discrim- 
inate. 

7. Conclusions 

Application of the Support Vector Machine algorithm can de- 
liver an excellent (with accuracy level for self-check test higher 
that 98% for galaxies, 94% for AGNs, and 93% for stars) clas- 
sification for three classes of objects, after a careful selection 
of the Training Sample. For our analysis we constructed two 
classifiers, with and without near infrared data using a multi- 
dimensional colour hyperspace. A part of the AGN and star sam- 
ples were extracted from the VVDS survey. We have found a sig- 
nificant improvement of the SVM classification (8% in the Total 
Accuracy of classifier) adding a NIR colour parameters to our 
feature vectors. 

For the optical+NIR classifier we obtained a very good 
agreement (93.60%, 81.80%, and 92.52% for galaxies, AGNs, 
and stars, respectively) with the VIPERS spectroscopic sam- 
ple with flag confidence level of z measurements equal to 95%. 
What makes our approach to SVM classification exceptional is 
that due to the enormous amount of excellent quality data we 
could create the classifier which was trained on the part of the 
most secure sources and then test it against the remaining secure 
objects to create the most efficient pattern recognition system. 
The VIPERS survey gathered a large number of sources (55 358) 
with very good spectroscopic measurements, which then were 
strictly analysed to obtain the most secure redshifts. This fact al- 
lowed for the choice of the best sample, which could be used as 
a base for the new method of automatic classification. 

SVM classifiers are mostly used in the literature for separa- 
tion of two classes of sources (e.g. stars and galaxies). The only 
recent application of the S VM to the galaxy/ A GNs/stars classi- 
fication was performed by Sag lia et all (120121) . who trained and 
used his classifier for the Pan- STARRS 1 data. Comparing the 
accuracies of our classifier and those of Sagl ia et al 1 (12012b we 
found that our self-check results look somewhat better (97%, 
95%, 97% vs 97%, 84%, 85% for galaxies, AGNs, stars for 
VIPERS and Pan-STARSSl classifier, respectively). However, 
we have to stress that both methods cannot be directly compared 
because of initial differences in both surveys. Pan-STARRSl is 
a purely limited survey, which implies a much higher variety of 
properties of all the sources it contains. In contrary, VIPERS was 
pre-selected to contain only 0.5<z<1.2 galaxies, witch assures 
that they form a much more distinct and better separated group 



in a multi-colour space. This may facilitate a separation between 
galaxies and AGNs, as well as a part of stars which were re- 
introduced to the VIPERS target sample as AGN candidates. On 
the other hand, the lack of 'typical' stars in the VIPERS database 
(rejected after colour, and half-light radius pre- selection) occu- 
pying the same colour-colour space as galaxies, may hamper our 
classification based only on colours, and decrease the efficiency 
of our classifier for sources from the real sample. 

Our approach allows us to photometrically classify sources 
in the VIPERS survey, augmenting the spectral information. By 
classifying the sources with low-quality spectra, we can improve 
the classification, enlarging the samples that may be used for 
analysis. Using the optical+NIR classifier, we have confirmed 
the class of 4 900 objects with low flag. Further improvement 
of our classifier by the addition of the morphology and emis- 
sion/absorption line information will improve the already very 
good performance of galaxy /AGN/star classifier. It will also al- 
low for the development of a more specific galaxy and AGN 
types classifications. 
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